About Dataset: The dataset is obtained from Kaggle.
The chosen cities are Riyadh, Jeddah, Dammam, and Khobar
Riyadh is the capital and largest city in Saudi Arabia, with the largest municipal population in the Middle East. Riyadh has a diverse range of people and cultures, it is still growing day by day.
Jeddah which located in the middle of the eastern coast of the red sea and is considered the economic and tourism capital of the country.
Dammam it lies on the Persian Gulf northwest of Bahrain Island and forms a larger metropolitan and industrial complex with Khobar, Qatif, and Dhahran.
Khobar city is one of the three main cities in the Eastern Province, the others being Dammam and Dhahran. It is developing into an important industrial city, with factories turning out industrial gas, dairy products, carbonated water, tissue paper and ready-made garments.
Columns Definition:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from googletrans import Translator
#Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
#Models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
import xgboost as xgb
from sklearn.svm import SVR
# Features selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
#Evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV
#pipeline
from sklearn.pipeline import Pipeline
DATA DESCRIPTION AND DATA CLEANING
# Import the dataset
df_Aqar = pd.read_csv('./SA_Aqar.csv')
# Dimension of the data:
print(f'The Dataset Contain {df_Aqar.shape[0]} Rows and {df_Aqar.shape[1]} Columns')
The Dataset Contain 3718 Rows and 24 Columns
# Print two records
df_Aqar.head(2)
| city | district | front | size | property_age | bedrooms | bathrooms | livingrooms | kitchen | garage | ... | roof | pool | frontyard | basement | duplex | stairs | elevator | fireplace | price | details | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | الرياض | حي العارض | شمال | 250 | 0 | 5 | 5 | 1 | 1 | 1 | ... | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 80000 | للايجار فيلا دبلكس في موقع ممتاز جدا بالقرب من... |
| 1 | الرياض | حي القادسية | جنوب | 370 | 0 | 4 | 5 | 2 | 1 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 60000 | *** فيلا درج مع الصالة جديدة ***\n\nعبارة عن م... |
2 rows × 24 columns
Save:
# Import the transleated dataset
df = pd.read_csv('./SA_Aqar-translated.csv')
# Dimension of the data:
print(f'The Dataset Contain {df.shape[0]} Rows and {df.shape[1]} Columns')
The Dataset Contain 3718 Rows and 25 Columns
# print all columns
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
display(df.head(2))
| Unnamed: 0 | city | district | front | size | property_age | bedrooms | bathrooms | livingrooms | kitchen | garage | driver_room | maid_room | furnished | ac | roof | pool | frontyard | basement | duplex | stairs | elevator | fireplace | price | details | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Riyadh | Al-Arid neighborhood | North | 250 | 0 | 5 | 5 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 80000 | للايجار فيلا دبلكس في موقع ممتاز جدا بالقرب من... |
| 1 | 1 | Riyadh | Qadisiyah neighborhood | South | 370 | 0 | 4 | 5 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 60000 | *** فيلا درج مع الصالة جديدة ***\n\nعبارة عن م... |
#Check for duplicate records
print('The datasset has',df.duplicated().sum(),'duplicate')
The datasset has 0 duplicate
#Check for null values
df.isna().sum()
Unnamed: 0 0 city 0 district 0 front 0 size 0 property_age 0 bedrooms 0 bathrooms 0 livingrooms 0 kitchen 0 garage 0 driver_room 0 maid_room 0 furnished 0 ac 0 roof 0 pool 0 frontyard 0 basement 0 duplex 0 stairs 0 elevator 0 fireplace 0 price 0 details 80 dtype: int64
The data has only one columns with null values, and this column will be dropped because it is un needed.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3718 entries, 0 to 3717 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 3718 non-null int64 1 city 3718 non-null object 2 district 3718 non-null object 3 front 3718 non-null object 4 size 3718 non-null int64 5 property_age 3718 non-null int64 6 bedrooms 3718 non-null int64 7 bathrooms 3718 non-null int64 8 livingrooms 3718 non-null int64 9 kitchen 3718 non-null int64 10 garage 3718 non-null int64 11 driver_room 3718 non-null int64 12 maid_room 3718 non-null int64 13 furnished 3718 non-null int64 14 ac 3718 non-null int64 15 roof 3718 non-null int64 16 pool 3718 non-null int64 17 frontyard 3718 non-null int64 18 basement 3718 non-null int64 19 duplex 3718 non-null int64 20 stairs 3718 non-null int64 21 elevator 3718 non-null int64 22 fireplace 3718 non-null int64 23 price 3718 non-null int64 24 details 3638 non-null object dtypes: int64(21), object(4) memory usage: 726.3+ KB
# Dropping unneeded column:
df.drop(['Unnamed: 0', 'details'], axis = 1, inplace = True)
#Save the clean version for the dashboard
#df.to_csv("Cleaned_SA_Real_Estate.csv")
df.describe()
| size | property_age | bedrooms | bathrooms | livingrooms | kitchen | garage | driver_room | maid_room | furnished | ac | roof | pool | frontyard | basement | duplex | stairs | elevator | fireplace | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3718.000000 | 3.718000e+03 |
| mean | 390.968531 | 5.064820 | 5.083916 | 4.606509 | 2.243948 | 0.909360 | 0.802044 | 0.495697 | 0.795320 | 0.123453 | 0.560785 | 0.521517 | 0.162453 | 0.802582 | 0.034158 | 0.499462 | 0.814416 | 0.080958 | 0.181280 | 8.738797e+04 |
| std | 1565.056135 | 7.590427 | 1.230040 | 0.703449 | 0.916436 | 0.287135 | 0.398512 | 0.500049 | 0.403522 | 0.329001 | 0.496358 | 0.499604 | 0.368915 | 0.398104 | 0.181660 | 0.500067 | 0.388823 | 0.272807 | 0.385302 | 7.063470e+04 |
| min | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000e+03 |
| 25% | 280.000000 | 0.000000 | 4.000000 | 4.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 5.500000e+04 |
| 50% | 330.000000 | 2.000000 | 5.000000 | 5.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 7.000000e+04 |
| 75% | 400.000000 | 7.000000 | 6.000000 | 5.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000e+05 |
| max | 95000.000000 | 36.000000 | 7.000000 | 5.000000 | 5.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.700000e+06 |
df.describe(exclude = 'number')
| city | district | front | |
|---|---|---|---|
| count | 3718 | 3718 | 3718 |
| unique | 4 | 174 | 10 |
| top | the news | King Fahd Suburb District | North |
| freq | 976 | 173 | 917 |
# Fixing names after the automatic translation
df.replace({'front' : {'the West' : 'West', 'Southwest' : 'South West', 'Northwest': 'North West'}},inplace = True)
df.replace({'city' : {'grandmother' : 'Jeddah', 'the news' : 'Khobar'}}, inplace = True)
### Removing outliers
# Define a method to remove outliers
def remove_outlier(df, col):
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
out_df = df.loc[(df[col] > lower_bound) & (df[col] < upper_bound)]
return out_df
#to plot the before and after we will remove ouliers in df2
df2 = remove_outlier(df,"price")
df2 = remove_outlier(df,"size")
plt.figure(figsize=(12,6))
plt.subplot(1, 2, 1)
sns.boxplot(data=df['price'])
plt.title('Price Before Removing Ouliers')
plt.subplot(1, 2, 2)
sns.boxplot(data=df2['price'])
plt.title('Price After Removing Ouliers')
plt.tight_layout()
plt.show()
plt.figure(figsize=(12,6))
plt.subplot(1, 2, 1)
sns.boxplot(data=df['size'])
plt.title('Size Before Removing Ouliers')
plt.subplot(1, 2, 2)
sns.boxplot(data=df2['size'])
plt.title('Size Before Removing Ouliers')
plt.tight_layout()
plt.show()
print('Dataset shape Before removing outlier:',df.shape)
print('Dataset shape After removing outlier:',df2.shape)
Dataset shape Before removing outlier: (3718, 23) Dataset shape After removing outlier: (3362, 23)
#Now we can remove outliers from the original df
df =remove_outlier(df,"price")
df =remove_outlier(df,"size")
corrMatrix = df.corr()
plt.figure(figsize=(18,12))
sns.heatmap(corrMatrix, annot=True, cmap='Blues')
plt.title("Correlation Heatmap")
plt.show()
The correlation matrix shows a positive correlation between our target 'Price', and other features like ac, driver room, maid room...etc. and a negative correlation between price and fireplace, bedrooms.
plt.figure(figsize=(10,5))
s = sns.boxplot(x='price',y='city', data=df, palette='Blues',showfliers=False)
plt.title('Five numbers summary for Real Estat Prices')
plt.xlabel('Prices in (SAR)')
plt.ylabel('City')
sns.set_theme(style="white")
sns.despine()
plt.show()
Box plot shows prices range in the 4 cites. for example, in Riyadh we can see the normal distribution in prices with median of 80k SAR, and 25% 50000 SAR. while 75% are approximately 100k SAR. In Jeddah we can see higher numbers than in Riyadh. but for Dammam and Khobar we see a lowest prices range.
plt.figure(figsize=(10,5))
s = sns.boxplot(x='price',y='front', data=df, palette='Blues',showfliers=False)
plt.title('Real Estat Prices per Front')
plt.xlabel('Prices in (SAR)')
plt.ylabel('Front')
sns.set_theme(style="white")
sns.despine()
plt.show()
The box plot shows that a property with front to North East or 4 streets have the highest range in price. hence we can say that front affect property price.
df3 =df.groupby(['property_age','price']).size().reset_index()
plt.figure(figsize=(15,6))
sns.countplot(data = df3, x = 'property_age', palette='Blues')
plt.title('Prices Per Property Age',fontsize=16)
plt.xlabel('Property Age')
plt.ylabel('Frequncy')
sns.set_theme(style="white")
sns.despine()
plt.show()
The count plot shows the prices frequency by property age.
plt.figure(figsize=(12,6))
ax =sns.scatterplot(data=df, x='size', y='price', hue='city',
sizes=(20, 200), legend="full", edgecolor='black', alpha=0.5)
ax.set_xlim(0,800)
ax.set_ylim(0,200000)
plt.title('Relation Between Property Size and Price By City',fontsize=16)
plt.xlabel('Property Size m2')
plt.ylabel('Price in (SAR)')
sns.set_theme(style="white")
sns.despine()
plt.show()
We can notice the positive correlation between property size and the prices in Saudi Riyal, per each city. and for readability we set limits for property size and price axis.
plt.figure(figsize=(12,6))
sns.countplot(data= df, x=df['bedrooms'],palette='Blues')
plt.title("Number of Bedrooms in Properties ",fontsize=16)
plt.ylabel('Number of properties')
sns.set_theme(style="white")
sns.despine()
plt.show()
plt.figure(figsize=(10,5))
sns.scatterplot(data=df, x="city", y="property_age")
plt.xticks(rotation = 70, fontsize = 9)
plt.title('Property Age per City',fontsize=16)
plt.ylabel('Property Age')
plt.xlabel('City')
sns.set_theme(style="white")
sns.despine()
plt.show()
The plot is showing the property age in Riyadh,Jaddah, Dammam and Khobar the highest property age is in Riyadh and less property age in Dammam
plt.figure(figsize=(20,10))
plt.subplot(231)
sns.scatterplot(data=df, x="size", y="price", hue="duplex")
plt.title('Duplex Price and Size Relation',fontsize=12)
plt.subplot(232)
sns.scatterplot(data=df, x="city", y="price", hue="duplex")
plt.title('Duplex Price per City',fontsize=12)
plt.show()
We used the size and city to compared between the duplex prices
plt.figure(figsize=(10,6))
sns.countplot(y = 'front', data=df, palette='Blues')
plt.title('Property counts per Front',fontsize=16)
plt.ylabel('Front')
sns.set_theme(style="white")
sns.despine()
plt.show()
plt.figure(figsize=(10,6))
sns.barplot(x="city", y="size", data=df, palette="Blues")
plt.xticks([0,1,2,3],['Khobar','Riyadh', 'Dammam', 'Jeddah'],rotation=0)
plt.xlabel('City')
plt.ylabel('Size m2')
plt.title("The size by city",fontsize=16)
sns.set_theme(style="white")
sns.despine()
plt.show()
The plot is declare the sizes in different cities the highest size is in Jeddah, but mostly for the rest ot the cities they have a same range of property size.
#Mean of property price per city
segments= df.groupby('city')['price'].mean()
# pie plot
fig = px.pie(segments,
values=segments.values,
names=segments.index,
title="Property Average Price per City",
template="seaborn",
color_discrete_sequence=px.colors.sequential.Blues)
fig.update_traces(rotation=-90, textinfo="percent+label")
fig.show()
The pie chart shows the average price per city, and we can see that Jeddah has the highest prices rate, then Riyadh and Khobar. while Dammam comes last as the lowest prices.
top_corr = df[['bedrooms', 'livingrooms', 'bathrooms', 'driver_room','maid_room']].copy()
pd.plotting.scatter_matrix(top_corr, figsize=(15,10),hist_kwds={'bins':30},diagonal='kde');
Theis matrix shows the top 5 correlated features that are expected to affect the real estate price.
p_age = pd.DataFrame({"Data": df['property_age']})
def change_func(x):
if x == 0:
return "New"
elif 1 < x <= 5:
return "Semi new"
elif 6 < x <= 15:
return "Semi old"
else:
return "Old"
p_age["age"] = p_age["Data"].apply(lambda x: change_func(x))
p_age.drop(['Data'], axis = 1, inplace = True)
p_age.value_counts().plot(kind='bar',figsize=(10,6))
plt.xticks([0,1,2,3],['New','Semi New', 'Semi Old', 'Old'],rotation=0)
plt.xlabel('Property Age')
plt.ylabel('Number of Poperties')
plt.title("Distrbution of poperties by age",fontsize=16)
plt.show()
This plot shows the distrbution of availabe properteis according to its age. New age = 0 Semi New age = 1-5 years Semi Old = 6 - 15 years Old = >15 years
# feature selection
def select_features(X_train, y_train, X_test):
# configure to select all features
fs = SelectKBest(score_func=f_regression, k='all')
# learn relationship from training data
fs.fit(X_train, y_train)
# transform train input data
X_train_fs = fs.transform(X_train)
# transform test input data
X_test_fs = fs.transform(X_test)
return X_train_fs, X_test_fs, fs
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)
# what are scores for the features
for i in range(len(fs.scores_)):
print('Feature %d: %f' % (i, fs.scores_[i]))
# plot the scores
plt.bar([i for i in range(len(fs.scores_))], fs.scores_)
plt.show()
Feature 0: 94.683233 Feature 1: 59.595086 Feature 2: 39.821574 Feature 3: 31.952137 Feature 4: 139.912831 Feature 5: 35.822266 Feature 6: 33.464352 Feature 7: 644.988249 Feature 8: 266.530346 Feature 9: 11.700623 Feature 10: 639.981410 Feature 11: 9.233451 Feature 12: 149.460354 Feature 13: 15.425919 Feature 14: 192.538146 Feature 15: 0.387854 Feature 16: 75.302043 Feature 17: 0.504129 Feature 18: 112.395245 Feature 19: 147.405810 Feature 20: 133.020524 Feature 21: 0.846316 Feature 22: 5.386716 Feature 23: 0.208209 Feature 24: 10.931836 Feature 25: 1.154821 Feature 26: 22.734401 Feature 27: 48.102426 Feature 28: 10.293771 Feature 29: 109.253240 Feature 30: 0.679751 Feature 31: 1.551421 Feature 32: 25.775714
#Dummies for the categorical variables
pd.get_dummies(df['city']).head(2)
| Dammam | Jeddah | Khobar | Riyadh | |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 |
pd.get_dummies(df['front']).head(2)
| 3 streets | 4 streets | East | North | North East | North West | South | South East | South West | West | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
pd.get_dummies(df['district']).head(2)
| Al Alia District | Al Amwaj District | Al Anwar District | Al Areej Al Gharbia Naghbarhad | Al Badiah District | Al Bawadi dystorka | Al Faisaliah neighborhood | Al Falah District | Al Ha'ir District | Al Hamdaniyah Naghbarhad | ... | safety district | sailed the southern district | sailed to the north of naghbarhad | sand district | seagull nagbarhad | ship district | tattoo ngbrhd | the adjective of dysregulation | the happiness neighborhood | torch district | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 rows × 165 columns
# Get dummies values for the categorical coulmns and merge them with the data frame after droping the original ones
df_final = pd.concat([df.drop(['district','front','city'], axis = 1),
pd.get_dummies(df['city']),pd.get_dummies(df['front'])], axis=1)
df_final
| size | property_age | bedrooms | bathrooms | livingrooms | kitchen | garage | driver_room | maid_room | furnished | ... | 3 streets | 4 streets | East | North | North East | North West | South | South East | South West | West | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 250 | 0 | 5 | 5 | 1 | 1 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 370 | 0 | 4 | 5 | 2 | 1 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 380 | 0 | 4 | 5 | 1 | 1 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 250 | 0 | 5 | 5 | 3 | 0 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 400 | 11 | 7 | 5 | 2 | 1 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3713 | 437 | 0 | 7 | 5 | 2 | 1 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3714 | 400 | 0 | 5 | 5 | 3 | 0 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3715 | 330 | 0 | 6 | 4 | 2 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3716 | 300 | 13 | 6 | 5 | 2 | 1 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3717 | 437 | 0 | 7 | 5 | 2 | 0 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3320 rows × 34 columns
# Our dataset now increased into 34 columns
df_final.shape
(3320, 34)
#Reorder columns
df_final = df_final[['size', 'property_age', 'bedrooms', 'bathrooms', 'livingrooms',
'kitchen', 'garage', 'driver_room', 'maid_room', 'furnished', 'ac',
'roof', 'pool', 'frontyard', 'basement', 'duplex', 'stairs', 'elevator',
'fireplace', 'Dammam', 'Jeddah', 'Khobar', 'Riyadh',
'3 streets', '4 streets', 'East', 'North', 'North East', 'North West',
'South', 'South East', 'South West', 'West','price']]
#Selecting X and y features
X = df_final.drop(['price'], axis=1)
y = df_final.price
# Split the dataset into tarin and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Check the length of the resulted subsets
print('X_train shape:',X_train.shape)
print('y_train shape:',y_train.shape)
print('-'*25)
print('X_test shape:',X_test.shape)
print('y_test shape:',y_test.shape)
X_train shape: (2656, 33) y_train shape: (2656,) ------------------------- X_test shape: (664, 33) y_test shape: (664,)
# Instantiate Scaler Object
scaler = StandardScaler()
# Apply to X data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train the model
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
lr_preds = lr.predict(X_test_scaled)
print('R sequared Score: {:.2f}'.format(r2_score(y_true=y_test, y_pred=lr_preds)))
R sequared Score: 0.47
lg = LogisticRegression( max_iter= 1000)
lg.fit(X_train_scaled, y_train)
lg_preds = lg.predict(X_test_scaled)
print('R sequared Score: {:.2f}'.format(r2_score(y_true=y_test, y_pred=lg_preds)))
R sequared Score: 0.53
#n_neighborsint, default=5
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
knn_preds = knn.predict(X_test_scaled)
print('R sequared Score: {:.2f}'.format(r2_score(y_true=y_test, y_pred=knn_preds)))
R sequared Score: 0.64
reg_forest = RandomForestRegressor(n_estimators = 12, random_state = 42, criterion='squared_error')
reg_forest.fit(X_train, y_train)
preds_forest = reg_forest.predict(X_test)
print('R sequared Score: {:.2f}'.format(r2_score(y_true=y_test, y_pred=preds_forest)))
R sequared Score: 0.72
# Score of train and test sets
score_on_train = reg_forest.score(X_train, y_train)
score_on_test = reg_forest.score(X_test, y_test)
print(score_on_train)
print(score_on_test)
0.9507074225651775 0.7184519533323981
reg_tree = DecisionTreeRegressor(random_state = 42, max_depth= 4, criterion='squared_error')
reg_tree.fit(X_train, y_train)
preds_tree = reg_tree.predict(X_test)
print('R sequared Score: {:.2f}'.format(r2_score(y_true=y_test, y_pred=preds_tree)))
R sequared Score: 0.55
-Linear base learner
# Convert the training and testing sets into DMatrixes
df_train = xgb.DMatrix(data=X_train, label=y_train)
df_test = xgb.DMatrix(data=X_test, label=y_test)
# Parameters with booster as gblinear for Linear base learner
params = {"booster": "gblinear", "objective": "reg:squarederror"}
xg_reg = xgb.train(params=params, dtrain=df_train, num_boost_round=5)
# Making predictions
predictions = xg_reg.predict(df_test)
print('R sequared Score: {:.2f}'.format(r2_score(y_true=y_test, y_pred=predictions)))
R sequared Score: 0.37
reg_svr = SVR(kernel = 'linear')
reg_svr.fit(X_train, y_train)
preds_svr = reg_svr.predict(X_test)
mean_absolute_error(y_true=y_test, y_pred=preds_svr)
print('R sequared Score: {:.2f}'.format(r2_score(y_true=y_test, y_pred=preds_svr)))
R sequared Score: 0.03
# Correlated columns with price, we will choose only the highest "positive and negative" features.
numeric_columns=df_final.select_dtypes(include=[np.number])
corr=numeric_columns.corr()
print(corr['price'].sort_values(ascending=False))
price 1.000000 ac 0.445123 driver_room 0.443963 maid_room 0.302379 basement 0.254777 livingrooms 0.239986 pool 0.233973 Jeddah 0.212184 South 0.197278 size 0.179532 stairs 0.160606 North East 0.147179 property_age 0.142279 bathrooms 0.117043 kitchen 0.116730 garage 0.109180 frontyard 0.086338 furnished 0.076449 4 streets 0.061030 Riyadh 0.046653 elevator 0.024325 3 streets 0.017398 duplex -0.010622 Khobar -0.014675 South East -0.016958 East -0.021712 South West -0.030463 North West -0.055776 roof -0.056249 North -0.091523 West -0.105612 bedrooms -0.117205 fireplace -0.187551 Dammam -0.228102 Name: price, dtype: float64
X0 = df_final.drop(['bedrooms', 'bathrooms', 'kitchen', 'garage','roof', 'frontyard', 'duplex',
'stairs', 'elevator', 'fireplace', 'price',
'Riyadh', 'Khobar',
'4 streets', 'East', 'North', 'North East',
'North West', 'South', 'South East', 'South West',
'West'], axis=1)
y0 = df_final.price
#Split the dataset into tarin and test
Xtrain, Xtest, ytrain, ytest = train_test_split(X0, y0, test_size=0.2, random_state=42)
# Check the length of the resulted subsets
print('X_train shape:',Xtrain.shape)
print('y_train shape:',ytrain.shape)
print('-'*25)
print('X_test shape:',Xtest.shape)
print('y_test shape:',ytest.shape)
X_train shape: (2656, 12) y_train shape: (2656,) ------------------------- X_test shape: (664, 12) y_test shape: (664,)
Xtrain.head()
| size | property_age | livingrooms | driver_room | maid_room | furnished | ac | pool | basement | Dammam | Jeddah | 3 streets | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3070 | 300 | 13 | 2 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 521 | 246 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2362 | 250 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 3038 | 312 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 601 | 200 | 24 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
y_train.head()
3070 90000 521 40000 2362 35000 3038 45000 601 20000 Name: price, dtype: int64
# Instantiate Scaler Object
scaler = StandardScaler()
Xtrain_scaled = scaler.fit_transform(Xtrain)
Xtest_scaled = scaler.transform(Xtest)
lr2 = LinearRegression()
lr2.fit(Xtrain_scaled, ytrain)
lr_preds2 = lr2.predict(Xtest_scaled)
print('R sequared Score: {:.2f}'.format(r2_score(y_true=ytest, y_pred=lr_preds2)))
R sequared Score: 0.43
lg2 = LogisticRegression( max_iter= 1000)
lg2.fit(Xtrain_scaled, ytrain)
lg2_preds = lg2.predict(Xtest_scaled)
print('R sequared Score: {:.2f}'.format(r2_score(y_true=ytest, y_pred=lg2_preds)))
R sequared Score: 0.35
knn2 = KNeighborsRegressor(n_neighbors=7)
knn2.fit(Xtrain_scaled, ytrain)
knn_preds2 = knn2.predict(Xtest_scaled)
print('R sequared Score: {:.2f}'.format(r2_score(y_true=ytest, y_pred=knn_preds2)))
R sequared Score: 0.69
reg_forest2 = RandomForestRegressor(n_estimators = 10, random_state = 42, criterion='squared_error')
reg_forest2.fit(Xtrain, ytrain)
preds_forest2 = reg_forest2.predict(Xtest)
print('R sequared Score: {:.2f}'.format(r2_score(y_true=ytest, y_pred=preds_forest2)))
R sequared Score: 0.68
# Plot Feature Importances to Visualize better
pd.DataFrame(dict(zip(Xtrain.columns, reg_forest2.feature_importances_)), index = [1])\
.T\
.sort_values(1, ascending=True)\
.plot(kind="barh", title="Feature Importances (%)");
reg_tree2 = DecisionTreeRegressor(random_state = 42, max_depth= 5, criterion='squared_error')
reg_tree2.fit(Xtrain, ytrain)
preds_tree2 = reg_tree2.predict(Xtest)
print('R sequared Score: {:.2f}'.format(r2_score(y_true=ytest, y_pred=preds_tree2)))
R sequared Score: 0.62
# Convert the training and testing sets into DMatrixes
df_train2 = xgb.DMatrix(data=Xtrain, label=ytrain)
df_test2 = xgb.DMatrix(data=Xtest, label=ytest)
# Parameters with booster as gblinear for Linear base learner
params = {"booster": "gblinear", "objective": "reg:squarederror"}
# Train the model
xg_reg2 = xgb.train(params=params, dtrain=df_train2, num_boost_round=5)
# Making predictions
xg_preds = xg_reg2.predict(df_test2)
print('R sequared Score: {:.2f}'.format(r2_score(y_true=ytest, y_pred=xg_preds)))
R sequared Score: 0.42
reg_svr2 = SVR(kernel = 'linear')
reg_svr2.fit(Xtrain, ytrain)
preds_svr2 = reg_svr2.predict(Xtest)
print('R sequared Score: {:.2f}'.format(r2_score(y_true=ytest, y_pred=preds_svr2)))
R sequared Score: 0.01
# Calculates the cost functions
def calc_cost(y_true, y_predict):
"Calculate Cost Functions and print output"
result_dict = {}
mae = mean_absolute_error(y_true, y_predict)
mape = mean_absolute_percentage_error(y_true, y_predict)
rmse = mean_squared_error(y_true, y_predict, squared=False)
# Round the numbers for readability
ls = [round(mae,2), round(mape,3), round(rmse,2)]
ls2 = ["MAE",'MAPE', "RMSE"]
for x in range(len(ls)):
result_dict[ls2[x]] = ls[x]
return result_dict
print('\033[1m'+ "Models Evaluation for All Features")
print("-"*80+ '\033[0m')
# The baseline model: replace values by the mean and calculate the cost functions
b_preds = [y_test.mean() for x in range(len(y_test))]
scores_ls = [calc_cost(y_test, b_preds),
calc_cost(y_test, lr.predict(X_test_scaled)),
calc_cost(y_test, lg.predict(X_test_scaled)),
calc_cost(y_test, knn.predict(X_test_scaled)),
calc_cost(y_test, reg_forest.predict(X_test)),
calc_cost(y_test, reg_tree.predict(X_test)),
calc_cost(y_test, xg_reg.predict(df_test)),
calc_cost(y_test, reg_svr.predict(X_test))]
res_df = pd.DataFrame(scores_ls).transpose()
res_df.columns=["Baseline", "Linear Regression", "Logistic Regressor", "KNN Regressor",
"Random Forest", "Decision Tree Regressor", "XGBoost Regressor", "SVR Regressor"]
res_df
Models Evaluation for All Features
--------------------------------------------------------------------------------
| Baseline | Linear Regression | Logistic Regressor | KNN Regressor | Random Forest | Decision Tree Regressor | XGBoost Regressor | SVR Regressor | |
|---|---|---|---|---|---|---|---|---|
| MAE | 26280.540 | 17958.040 | 11101.200 | 9835.370 | 8442.900 | 15791.830 | 20445.810 | 24790.85 |
| MAPE | 0.562 | 0.401 | 0.194 | 0.252 | 0.216 | 0.358 | 0.466 | 0.50 |
| RMSE | 33311.610 | 24204.110 | 22766.490 | 20005.980 | 17675.500 | 22435.930 | 26477.160 | 32844.36 |
#Actual value and the predicted value for Random forest model
mlr_diff = pd.DataFrame({'Actual value': y_test, 'Predicted value': preds_forest})
mlr_diff.head()
| Actual value | Predicted value | |
|---|---|---|
| 1524 | 150000 | 150000.000000 |
| 2613 | 60000 | 60000.000000 |
| 2961 | 85000 | 85000.000000 |
| 912 | 75000 | 61694.416667 |
| 3634 | 150000 | 150000.000000 |
# Calculates the cost functions
def calc_cost(y_true, y_predict):
"Calculate Cost Functions and print output"
result_dict = {}
mae = mean_absolute_error(y_true, y_predict)
mape = mean_absolute_percentage_error(y_true, y_predict)
rmse = mean_squared_error(y_true, y_predict, squared=False)
# Round the numbers for readability
ls = [round(mae), mape, round(rmse)]
ls2 = ["MAE",'MAPE', "RMSE"]
for x in range(len(ls)):
result_dict[ls2[x]] = ls[x]
return result_dict
print('\033[1m'+ 'Models Evaluation For Selected Features')
print("-"*80+ '\033[0m')
#Basline model
b_preds2 = [ytest.mean() for x in range(len(ytest))]
#Create a list of cost function resluts
scores2_ls = [calc_cost(ytest, b_preds2),
calc_cost(ytest, lr2.predict(Xtest_scaled)),
calc_cost(ytest, lg2.predict(Xtest_scaled)),
calc_cost(ytest, knn2.predict(Xtest_scaled)),
calc_cost(ytest, reg_forest2.predict(Xtest)),
calc_cost(ytest, reg_tree2.predict(Xtest)),
calc_cost(ytest, xg_reg2.predict(df_test2)),
calc_cost(ytest, reg_svr2.predict(Xtest))]
#Create a dataframe of the list
res2_df = pd.DataFrame(scores2_ls).transpose()
res2_df.columns=["Baseline", "Linear Regression", "Logistic Regressor", "KNN Regressor",
"Random Forest", "Decision Tree Regressor", "XGBoost Regressor", "SVR Regressor"]
res2_df
Models Evaluation For Selected Features
--------------------------------------------------------------------------------
| Baseline | Linear Regression | Logistic Regressor | KNN Regressor | Random Forest | Decision Tree Regressor | XGBoost Regressor | SVR Regressor | |
|---|---|---|---|---|---|---|---|---|
| MAE | 26281.000000 | 18621.00000 | 15726.00000 | 9040.000000 | 9061.000000 | 13423.000000 | 19286.000000 | 25041.000000 |
| MAPE | 0.562042 | 0.39743 | 0.34224 | 0.237399 | 0.236782 | 0.319929 | 0.416341 | 0.497445 |
| RMSE | 33312.000000 | 25169.00000 | 26920.00000 | 18560.000000 | 18760.000000 | 20522.000000 | 25475.000000 | 33200.000000 |
param_grid = { 'bootstrap': [True], 'max_depth': [5, 10, None],
'max_features': ['auto', 'log2'], 'n_estimators': [5, 6, 7, 8, 9, 10, 11, 12, 13, 15]}
rfr = RandomForestRegressor(random_state = 42)
g_search = GridSearchCV(estimator = rfr, param_grid = param_grid,
cv = 3, n_jobs = 1, verbose = 0, return_train_score=True)
g_search.fit(X_train, y_train);
print(g_search.best_params_)
{'bootstrap': True, 'max_depth': None, 'max_features': 'log2', 'n_estimators': 15}
# Grid Search Parameters
grid_search_params = {
'colsample_bytree': [0.3, 0.7],
'learning_rate': [0.01, 0.1, 0.2, 0.5],
'n_estimators': [100],
'subsample': [0.2, 0.5, 0.8],
'max_depth': [2, 3, 5]
}
xg_grid_reg = xgb.XGBRegressor(objective= "reg:squarederror")
grid = GridSearchCV( estimator = xg_grid_reg,
param_grid = grid_search_params,
scoring = 'neg_mean_squared_error',
cv = 4, verbose = 1)
grid.fit(X, y)
Fitting 4 folds for each of 72 candidates, totalling 288 fits
GridSearchCV(cv=4,
estimator=XGBRegressor(base_score=None, booster=None,
callbacks=None, colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
gamma=None, gpu_id=None, grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=None, max_bin=None,
max_cat...
min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None,
reg_alpha=None, reg_lambda=None, ...),
param_grid={'colsample_bytree': [0.3, 0.7],
'learning_rate': [0.01, 0.1, 0.2, 0.5],
'max_depth': [2, 3, 5], 'n_estimators': [100],
'subsample': [0.2, 0.5, 0.8]},
scoring='neg_mean_squared_error', verbose=1)
grid.best_params_
{'colsample_bytree': 0.3,
'learning_rate': 0.1,
'max_depth': 5,
'n_estimators': 100,
'subsample': 0.2}
grid.best_estimator_
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.3,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
importance_type=None, interaction_constraints='',
learning_rate=0.1, max_bin=256, max_cat_to_onehot=4,
max_delta_step=0, max_depth=5, max_leaves=0, min_child_weight=1,
missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, ...)
#RF
rf_Reg = Pipeline(steps=[('RFregressor', RandomForestRegressor(random_state=42, n_estimators = 15,
bootstrap= True,
max_depth= None, max_features= 'log2'))])
#XGB
xg_Reg = Pipeline(
steps=[
('XGBregressor', xgb.XGBRegressor(objective= "reg:squarederror", colsample_bytree= 0.3,
learning_rate= 0.1,
max_depth= 5,
n_estimators= 100,
subsample= 0.2))
]
)
rf_Reg.fit(X_train, y_train)
xg_Reg.fit(X_train, y_train)
Pipeline(steps=[('XGBregressor',
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=0.3, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
gamma=0, gpu_id=-1, grow_policy='depthwise',
importance_type=None, interaction_constraints='',
learning_rate=0.1, max_bin=256,
max_cat_to_onehot=4, max_delta_step=0,
max_depth=5, max_leaves=0, min_child_weight=1,
missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=0, num_parallel_tree=1,
predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, ...))])
# Create a data frame form a dictionary of model:score
dict_ = {'Random Forest': r2_score(y_true=y_test, y_pred=preds_forest),
'Optimal Random Forest':rf_Reg.score(X_test, y_test),
'XGBoost': r2_score(y_true=y_test, y_pred=predictions),
'Optimal XGBoost':xg_Reg.score(X_test, y_test)}
results = pd.DataFrame(dict_.items(), columns=['Model', 'Score'])
results
| Model | Score | |
|---|---|---|
| 0 | Random Forest | 0.718452 |
| 1 | Optimal Random Forest | 0.718991 |
| 2 | XGBoost | 0.368241 |
| 3 | Optimal XGBoost | 0.709933 |